EMU: A Web Portal Generator

نویسندگان

  • Aaron R. Bradley
  • Andrew M. Bradley
چکیده

A web portal is a set of HTML documents that comprise a tree rooted at the portal entry site; the leaves are links to other websites. Each node within the tree represents a cluster of documents, decreasing in generality as one descends down the tree. Portals are useful for organizing and enabling users to search large repositories of documents, either centralized or distributed. Whereas search engines require the user to specify search criteria, typically as key words, to search an opaque document bank, a portal visualizes the document bank and can be particularly useful if the query is not amenable to representation by key words. Popular portals, such as those hosted by AltaVista, Google, Yahoo, and dmoz.org, are created by hand. For example, the Open Directory Project at dmoz.org enlists volunteers to manages subtopics. In contrast, EMU is an automated system that uses statistical NLP and unsupervised machine learning techniques to generate a web portal. While the quality of the portal must certainly suffer, less frequented document repositories and Usenets may benefit from an automated system. Additionally, EMU is practical for generating daily navigators of newspapers, internal documents of a company, and other dynamic document banks. Many companies, for example, provide employees with daily or weekly summaries of domain-specific periodicals relating to the company’s business; EMU would provide a practical alternative. Finally, hierarchical clustering can serve as a preprocessing step for fast online clustering used in document retrieval and user queries. Previous work has used hierarchical clustering for navigational purposes. Scatter/Gather [3] uses a clustering approach to summarize a set of documents; however, they adapt an interactive approach with the user. A typical session consists of (flat) clustering a set of documents and labeling the clusters, allowing the chooser to select a cluster to explore, and repeating the process on the new cluster. This corresponds directly to the EMU interface of selecting superclusters to explore in more depth, although the interactive approach allows the system to take advantage of fast non-hierarchical clustering algorithms. EMU, however, runs only once to generate a summary; hence, it is appropriate for generating summaries that many people will read via the web. This report is organized as follows. In Section 2, we present an overview of the EMU system, while in Section 3, we consider EMU’s document representation and learning algorithms. Section 4 presents our testing techniques; we also evaluate our results on a test set. Finally, Section 5 concludes with ideas for applications and future work.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Outbreak of Pasteurellosis in Captive Emu Birds and Detection of Virulence Genes in P. Multocida Isolates

Avian pasteurellosis caused by Pasteurella multocida was reported among 180 species of wild birds but not reported so far in emu birds. Several emu birds were reported to have died over a period of one week in an organized emu farm. Necropsy findings revealed typical haemorrhagic lesions and characteristic bipolar organism, suggestive of avian pasteurellousis. Two Pasteurella multocida isolates...

متن کامل

Investigating the Effect of Mentorship Portal on the Interaction of Student and Mentors in Clinical Fields

Background: University has a great responsibility for the education of students. In order to achieve this target, it must provide academic counseling for students. The aim of this study is to investigate the attitudes of students after using a portal for virtual counseling. Methods: A questionnaire which is based on the goals that mentioned in mentorship law was designed and validated by educa...

متن کامل

Web-based Grid PSE for MD Simulations using Grid PSE Builder

In large-scale scientific simulations, we sometimes have access to supercomputers and largescale cluster systems of appropriate computer centers at remote sites as a way of obtaining large scale computer resources. In this case, some difficulties in using such resources due to differences in rules and job execution at each computer center have been found. Differences among remote sites to be ac...

متن کامل

Text Mining for Semantic Relations as a Support Base of a Scientific Portal Generator

Current Semantic Web implementation efforts pose a number of challenges. One of the big ones among them is development and evolution of specific resources — the ontologies — as a base for representation of the meaning of the web. This paper deals with the automatic acquisition of semantic relations from the text of scientific publications (journal articles, conference papers, project descriptio...

متن کامل

ارزیابی وب پورتال‌های کتابخانه‌های عمومی مستقر در استان تهران

Purpose: In this research, we aim to identify differences in important characteristics of the web portals of the Iran Public Libraries Foundation (IPLF) and Art and Cultural Organization of Tehran Municipality (ACOTM) according to the assessment criteria that extracted and formulated by researchers. Methodology: We used descriptive survey method. Participants include 646 librarians of IPLF and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001